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Abstract 


We  report  recognition  results  for  several  pattern  classifiers  trained  and  tested  on  disjoint  sets  of  30620  digits 
selected  from  the  first  500  writers  of  NIST  Special  Database  3.  The  classifiers  are  ubiquitous  in  traditional 
pattern  recognition  literature  (minimum  distance,  maximum  a posteriori,  nearest  neighbor)  as  well  as  neural 
network  literature  (multilayer  perceptron,  radial  basis  functions,  probabilistic  neural  network).  For  the  purpose 
of  valid  comparison  of  classifiers  fixed  sets  of  Karhunen-Loeve  Transforms,  were  used  as  features.  These  were 
produced  from  images  preprocessed  using  the  fixed  methods  for  size  and  orientation  normalization.  The  “K- 
means”  clustering  algorithm  is  used  to  produce  subclasses  thereby  supervising  training  and  aiding  recognition. 
Graphical  displays  of  classification  and  associated  confidences  illustrate  classifier  complexity.  Recognition  error 
rates  for  all  the  classifiers  are  tabulated  as  a function  of  feature  vector  dimension.  Computational  and  memory 
requirements  of  the  different  classifiers  are  also  compared. 


1 Introduction 

Optical  Character  Recognition  (OCR)  has  been  a popular  focus  of  Pattern  Recognition  research  since  at  least  the 
1960’s.  The  ready  availability  of  image  samples  and  the  continuing  challenge  of  commercially  viable  recognition 
has  meant  that  OCR  research  is  ongoing.  However  classification  of  loosely  constrained  handwritten  digits,  at 
least,  is  essentially  a solved  problem  [1]. 

A good  review  of  OCR  is  found  in  [2].  The  huge  quantity  of  research  from  academia  and  industry  has  yielded 
a multitude  of  algorithms  for  normalization  [3]  [4],  feature  extraction  [5],  and  classification  [6]  [7]  [8]  [9],  that 
are  capable  of  digit  OCR.  The  popularity  of  OCR  research  was  maintained  with  the  advent  of  neural  network 
paradigms  applicable  to  feature  extraction  and  classification.  The  advantage  of  many  neural  network  classifiers, 
once  trained,  is  their  efficiency,  and  despite  advances  in  computational  resources,  future  commercial  segmentation 
and  recognition  efforts  [10]  [11]  will  be  precluded  from  using  numerous  techniques  from  the  literature  because  of 
their  algorithmic  computational  requirements.  The  trade-off  between  classification  performance  and  computa- 
tional requirements  has  prompted  this  study  of  digit  classifier  efficacy.  The  reader  should  see  Kimura  et  al.  [12] 
for  a similar  survey. 


2 NIST  OCR  Databases 


The  classifiers  described  in  this  report  were  trained  and  tested  using  feature  vectors  derived  from  the  digit  images 
of  NIST  Special  Database  3 [13].  This  database  consists  of  binary  128  by  128  pixel  raster  images  segmented  from 
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Figure  1:  Components  of  Classification  System 


the  sample  forms  of  2100  writers  published  on  CD  as  [14].  External  results  on  segmentation  and  recognition  of 
this  database  have  been  reported  [15].  The  relative  difficulties  of  the  NIST  OCR  databases  have  been  discussed 
in  [16].  For  this  study  samples  are  drawn  randomly  from  the  first  250  writers  to  yield  a training  set  of  7480  digits 
with  a priori  class  probabilities  all  equal  to  0.1.  Even  for  digits,  depending  on  the  application,  certain  classes  may 
be  more  prevalent;  in  banking  tasks,  for  example,  “0”  is  more  common.  The  test  set  is  similarly  constructed  from 
the  second  250  writers  yielding  23140  samples.  The  images  are  size  normalized  by  pixel  deletion,  stroke  width  is 
bounded  by  binary  erosion  and  dilation,  and  consistent  orientation  is  effected  by  row  shearing. 


3 Components  of  Classification  System 

Each  experimental  OCR  classifier  notionally  comprises  the  modules  of  Figure  1.  The  ovals  indicate  inputs  and 
outputs,  the  rectangles  represent  processing  modules,  and  the  arrows  show  logical  procedure.  A 32  pixel  square 
raster  contains  the  normalized  digit  image.  The  feature  extraction  module  linearly  transforms  the  binary  data 
yielding  reduced  dimensionality  classifiable  features.  The  next  module  is  the  bank  of  “Discriminant  Functions”. 
There  are  as  many  discriminant  functions  as  there  were  clusters  in  the  training  set.  For  several  classifiers  more 
than  one  cluster  per  class  was  used.  Each  function  maps  an  n dimensional  extracted  feature  vector  to  a scalar 
which  adopts  a large  value  if  the  unknown  input  is  of  the  cluster  corresponding  to  that  particular  discriminant 
function.  The  values  produced  by  the  bank  of  discriminant  functions  are  sent,  finally,  to  both  the  “Class  Finder” 
and  the  “Rejector”.  The  class  finder  infers  the  hypothesized  class  from  the  discrimant  values  associated  with  each 
cluster,  yielding  the  classifier’s  best  guess  of  the  class.  The  rejector  module  computes  a “confidence  function” 
of  the  discriminant  values,  compares  the  result  with  a specified  threshold,  and  thereby  indicates  whether  the 
unknown  should  be  accepted  or  rejected.  Rejection  implies  the  hypothesized  class  is  not  trustworthy  and  such 
examples  are  either  ignored  or  reclassified  more  robustly. 


4 Feature  Extractor 

The  normalized  input  images  are  2^  pixels  high  and  the  width  is  less  than  or  equal  to  the  height.  Raster 
dimensionality  is  2^°.  A lower  dimensionality  feature  vector  is  obtained  as  the  incomplete  Karhunen-Loeve  (K-L) 
transformation  of  the  image,  typically  of  useful  dimension  < 2®.  The  K-L  transformation  is  an  orthonormal 
function  and  corresponds  to  projection  of  the  images  onto  the  eigenvectors  of  the  covariance  matrix  of  the  image 
data.  (The  production  of  this  transform  is  also  known  as  principal  factors  or  principal  components  analysis.)  The 
covariance  matrix  is  diagonalized  (using,  for  example,  EISPACK  serial  Fortran  routines)  producing  the  largest 
eigenvalues  and  corresponding  eigenvectors.  Feature  extraction  is  thus  the  application  of  an  affine  function  to 
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Figure  2:  Eigenvalues  and  Eigenvectors.  The  latter  are  shown  in  eigenvalue  order  top  to  bottom  then  left  to  right. 
Note  the  central  support  on  the  x-axis  for  the  basis  functions  due  to  the  tight  constraints  imposed  by  the  size 
normalization.  It  is  visually  apparent  that  several  eigenvectors  will  have  large  correlations  and  anti-correlations 
with  several  digit  classes. 


the  image;  the  mean  image  vector  is  subtracted  and  the  result  is  premultiplied  by  the  matrix  whose  rows  are  the 
eigenvectors.  Thus  each  element  of  the  K-L  transform  is  the  projection  of  the  image  onto  a basis  vector.  The 
sample  variance  of  that  element  is  the  respective  eigenvalue,  which  is  therefore  a measure  of  its  “size”.  Only  64  of 
the  eigenvectors  were  retained  with  eigenvalues  ranging  from  72.9  down  to  0.9.  Figure  2 shows  the  eigenvalues  on 
a log  scale;  their  rapid  decline  implies  that  the  information  content  of  the  K-L  features  is  concentrated  in  the  first 
few  features.  This  variance  ordering  of  the  features  provides  a consistent  method  for  reducing  the  dimensionality 
of  the  feature  sets,  the  less  variant  coefficients  being  discarded  first.  Figure  2 also  shows  the  leading  eigenvectors. 


Figure  3 shows  locations  of  the  748  training  examples  per  class  represented  using  the  first  two  K-L  features. 
Although  graphical  representation  of  high  dimsensional  spaces  is  the  perennial  problem  for  pattern  recognition 
it  is  apparent  that  two  features  are  insufficient  to  separate  all  classes.  Despite  “0”s  and  “l”s,  and  “6”s  and  “9”s 
separating  reasonably,  no  classifier,  in  two  dimensions,  acheived  better  than  50%  test  set  error. 

Only  one  type  of  feature  has  been  considered  for  this  comparsion  of  classifiers.  Although  many  other  feature  types 
are  known  to  be  classifiable  for  OCR  [17]  [18],  the  Karhunen-Loeve  transform  is,  among  the  unitary  transforms, 
an  optimally  compact  signal  representation  of  the  original  data.  These  features  have  the  further  benefit  that 
reconstruction  of  the  images  is  possible  and  the  K-L  transform  is  optimal,  for  n coefficients  of  a unitary  transform, 
at  mean  square  error  between  reconstruction  and  original.  The  variance  ordering  is  useful  for  comparing  classifiers 
at  reduced  dimensionalities. 


5 Feature  Clustering 

For  several  classifiers  described  below  it  is  well  known  [19]  that  splitting  class  prototypes  into  clusters  yields 
improved  performance.  Indeed  clustering  algorithms  can  be  used  for  unsupervised  classification  [20].  One  readily 
available  method  is  the  “K-means”  algorithm  [21]  [22].  The  examples  are  iteratively  split  into  clusters  as  follows. 
The  first  iteration  starts  with  the  cluster  of  all  prototypes  and  computes  the  distances  from  each  to  the  cluster 
center.  The  second  iteration  divides  the  cluster  into  two,  moving  cases  from  one  to  the  other  until  no  further 
movements  decrease  the  distances  between  each  case  and  the  center  of  its  assigned  cluster.  For  subsequent 
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Figure  3:  Locations  by  class  of  the  training  samples.  Compare  these  with  the  EMD  classification  map  of  figure 
4.  The  horizontal  and  vertical  cixes  correspond  to  the  first  and  second  eigenvectors  respectively. 


iterations,  the  cluster  with  the  largest  variance  is  split  and  its  prototypes  are  assigned  to  the  cluster  whose  mean 
is  least  distant.  The  means  are  then  updated  and  reassignment  continues  until  no  movements  are  made  for  an 
iteration.  It  is  important  to  note  that  this  distance  clustering  is  classless  such  that  clusters  could  be  formed  from 
neighbors  of  different  class.  Supervison  is  enforced  by  application  of  the  algorithm  to  each  class  independently. 


6 Classifiers  and  Discriminant  Functions 

Each  classifer  consists  of  a bank  of  discriminant  functions.  The  classifiers  are  separated  into  three  categories.  It 
is,  however,  notable  that  the  category  names  are  somewhat  arbitrary  ant  that  some  classifiers  have  attributes 
of  more  than  one  category.  In  the  statistical  pattern  recognition  literature  [23]  the  Parametric  classifiers  use 
variables  such  as  the  expected  means  and  covariances  to  express  the  class  density  functions.  In  assuming,  for 
example,  linear  and  quadratic  forms  for  our  discriminant  functions  we  categorize  simple  Euclidean  Minimum 
Distance  (EMD),  the  more  advanced  Quadratic  Minimum  Distance  (QMD)  and  the  Normal  (NRML)  classifier 
as  parametric  classifiers.  The  Non-Parameiric  classifiers  do  not  adopt  a structured  expression  of  the  density 
functions;  two  Nearest  Neighbor  classifiers,  the  popular  K-NN  and  an  improvement  termed  Weighted  Several 
Nearest  Neighbors  (WSNN)  are  considered.  Finally,  the  Neural  Net  category  contains  the  Multi-Layer  Perceptron 
(MLP),  Radial  Basis  Functions  classifiers  of  two  types  (RBFl  and  RBF2),  and  the  Probabilistic  Neural  Net  (PNN). 

For  each  type  of  discriminant  function,  one  or  more  diagrams  are  provided  showing  the  resulting  hypothetical 
class  regions  in  two-dimensional  feature  space.  These  diagrams  show  the  hypothesized  classifications  of  regularly 
spaced  feature  vectors  sampled  over  the  square  region  centered  on  (0,0)  and  with  extent  large  enough  to  contain 
the  training  vectors.  Restriction  of  this  graphical  representation  to  two  dimensions  is  undeniably,  but  necessarily, 
not  ideal.  These  class  maps  should  be  compared  with  the  real  distributions  of  figure  3. 

6.1  Notation 

The  notation  below  will  be  used  in  the  descriptions  of  the  discriminant  functions. 

L = number  of  classes.  For  digits,  L = 10 
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N 

n 

R" 

X 

M, 

x(') 

m, 

S, 

Si 

?’^(x,y,z) 


A(x) 


number  of  clusters,  N > L 
dimensionality  of  features 

the  set  of  all  n-tuples  of  real  numbers  = “feature  space” 
extracted  “feature  vector”  of  a digit  (x  6 R"^) 
number  of  training  prints  of  cluster  i {I  < i < N) 

feature  vector  from  digit  of  cluster  i {I  Mi)  (x^^^  £ R") 

mean  feature  vector  for  cluster  i {1  < i < N)  {fj,i  R") 
an  estimate  of 

covariance  matrix  for  cluster  i (1  < f < N){'Si  G R"^*^) 
an  estimate  of  Si 

(x  — y)^(x  — y)  = squared  Euclidean  distance  between  x and  y (x,  y G R") 
- yi)/zi)^ 

i = l 

distance  between  x and  y normalized  by  z (x,  y,z  G R") 
discriminant  function  ( 1 < ? < -/V,  x G R” ) 


6.2  Parametric  Classifiers 

6.2.1  Euclidean  Minimum  Distance  (EMD)  Classifier 

This  is  perhaps  one  of  the  simplest  classifiers  that  one  can  design.  Its  discriminant  functions  are  of  the  form 

Di{x)  = -d^(x,mi). 

An  unknown  is  assigned  the  class  associated  with  the  cluster  of  the  highest-valued  discriminant  function.  This  is 
equivalent  to  using  the  class  label  of  the  estimated  cluster-mean  that  is  closest,  in  the  Euclidean  distance  sense,  to 
the  unknown.  In  the  one  cluster  per  class  case  the  hypothetical  class  regions  are  convex  polygons.  This  classifier  is 
essentially  the  Perceptron  (although  the  actual  boundaries  may  be  different)  whose  linear  separability  limitations 
were  described  by  Minsky  and  Papert  [24].  Figure  4 shows  the  class  regions  when  only  two  features  are  used. 
The  estimated  cluster  mean  vectors  m,  are  marked  with  plus  signs. 


6.2.2  Quadratic  Minimmn  Distance  (QMD)  Classifier 

The  training  examples  of  each  cluster  i are  used  to  produce  sample  covariance  matrices,  Sj,  and  estimated  mean 
vectors  m,  . The  following  discriminants  are  used; 

A(x)  = -(x  - mi)'^S“^(x  - mi) 

z = A~^^J{x-m.i) 

That  is,  the  cluster  mean,  mj,  is  first  subtracted  from  the  unknown,  and  the  result  projected  onto  the  eigenvectors 
of  the  cluster  i covariance  matrix,  and  finally  whitened  by  dividing  each  component  by  the  root  of  the 
corresponding  eigenvalues  A,  This  can  be  thought  of  as  a form  intermediate  between  EMD  and  the  Normal 
(NRML)  classifier,  described  below.  Figure  4 shows  the  resulting  class  regions;  the  boundary  between  any  two 
clusters  is  quadratic.  When  the  number  of  clusters  per  class  increases  the  inverse  covariance  matrices  for  a given 
cluster  are  formed  from  a decreasing  number  of  training  examples.  Computational  difficulties  occur  when  the 
number  of  cluster  examples  forming  the  covariance  is  small.  The  rank  of  Sj  may  then  be  less  than  n and  Si  is 
singular  preventing  its  conventional  inverse  from  being  evaluated.  It  should  be  noted  that  QMD  is  not  a true 
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Figure  4:  Parametric  Classifiers.  For  EMD  note  the  class  boundaries  are  the  perpendicular  bisectors  of  segments 
connecting  pairs  of  class  means.  For  QMD  and  NRML  note  the  quadratic  forms  of  the  decision  boundaries.  The 
+ signs  indicate  the  locations  of  the  estimated  class  means. 


parametric  classifier  in  the  sense  that  the  estimated  conditional  density  functions  do  not  integrate  to  unity.  If 
this  condition  is  forced  then  the  normal  classifier  results. 


6.2.3  Normal  (NRML)  Classifier 


This  classifier  is  based  on  parametric  density  estimation  that  presupposes  a multivariate  normal  distribution  for 
each  class.  First,  it  will  be  useful  to  mention  a few  facts  that  pertain  to  any  parametric  classifier,  using  the 
following  terminology: 


p{i) 

p(x) 

p(x|i) 

p(z'|x) 


a priori  probability  of  cluster  i 

loss  incurred  by  classifying  to  i a print  that  is  of  cluster  j (1  < < N) 

mixture  density:  for  5 C R”,  fgp(x)dx  = P(x  E -S') 

conditional  density:  for  S C R”,  J^p(xli)dx  = P{x  € S\x  is  from  a cluster-i  print) 
a posteriori  probability:  for  a particular  x,  p(ilx)  = P(x  is  from  a cluster-i  print) 


Given  a particular  loss  function  A(i|j),  the  optimal  or  “Bayesian”  classifier  is  the  one  that  minimizes  the  expected 
loss.  Define  the  “symmetric”  loss  function  in  terms  of  the  Kronecker  delta: 


A(i|;)  = 1 - 6ij 


0 i = 3 

1 otherwise 


This  means  that  correct  classifications  produce  no  losses  and  that  all  kinds  of  incorrect  classifications  produce 
equal  loss  values  of  1 unit.  In  this  case,  the  Bayesian  classifier  is  the  one  that  classifies  each  unknown  x to  the 
cluster  i for  which  the  a posteriori  probability  p(i|x)  is  highest.  According  to  Bayes’s  rule  [25], 

.■l„^  _ p(^■Mxi^■) 

Since  the  value  of  the  mixture  density  p(x)  has  no  effect  on  which  possible  i value  maximizes  p(zlx),  p(x)  may  be 
disregarded.  Also  for  a pattern  recognition  problem  in  which  the  a priori  probabilities  are  the  same  then  the  p(f) 
can  be  ignored.  The  result  is  to  classify  x to  the  cluster  i for  which  p(x|z)  is  highest.  For  the  Normal  classifier 
each  cluster,  i,  is  assumed  to  have  conditional  density  function 

p(x|f)  = (27r)-t|Si|-5exp  ^-^(x  - /Xi)'^S~^(x  - ^i)^  , 


where  pLi  and  Sj  are  the  mean  vector  and  covariance  matrix  for  cluster  i.  For  classification  the  (27r)“  t term  is 
constant  and  may  be  discarded.  Finally  by  replacing  the  mean  vectors  pi  and  covariance  matrices  Sj  with  their 
sample  estimates,  mi  and  Si,  squaring,  and  taking  logarithms  the  discriminant  function  for  the  Normal  classifier 
becomes 


Di{x)  = -log|Si|  - (x  - mi)'’^Sj  ^(x  - mi). 
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1-NN 


Figure  5:  Single  Nearest  Neighbor  Classifier.  Note  the  very  intricate  non-contiguous  decision  boundaries  local  to 
each  training  prototype. 


The  hypothetical  class  regions  are  given  in  figure  4.  The  location  of  the  means  in  that  figure,  indicated  by  the  + 
signs,  shows  that  if  x = nij  then,  in  two  dimensions,  misclassification  can  result  solely  because  of  the  determinant 
terms. 

6.3  Nearest  Neighbor  Classifiers 

Nearest-neighbor  classifiers  have  been  the  subject  of  decades  of  research  (see,  for  example,  Dasarathy’s  collection 
of  papers  [36]).  The  following  are  simple  and  ubiquitous  yet  effective  examples  of  such  methods. 


6.3.1  k-Ne^L^est  Neighbor  (k-NN) 

If  ^ = 1,  this  is  an  elaboration  of  EMD;  instead  of  using  just  nij,  as  a single  prototype  for  the  class,  the  1- 
NN  classifier  uses  all  of  the  class-z  training  examples  as  prototypes  for  the  class.  The  1-NN  classification  of  an 
unknown  vector  is  simply  the  class  of  the  nearest  prototype.  This  rule  is  intuitively  appealing,  and  Cover  and 
Hart  [6]  have  shown  it  to  have  good  asymptotic  behavior:  under  mild  assumptions,  its  large-sample  probability  of 
error  is  bounded  above  by  twice  the  Bayes  (i.e.  minimum  possible)  probability  of  error.  The  1-NN  discriminant 
functions  have  the  form: 

T>j(x)  = — min  d^fx, 

Figure  5 shows  the  class  regions.  Each  region  is  the  union  of  many  convex  polygons  each  containing  a single 
prototype  of  the  class;  hence,  a class  region  is  a very  complicated  polygon,  not  necessarily  convex  or  even 
connected.  In  the  more  general  case  voting  between  the  k nearest  neighbors  is  used.  The  majority  class  is  used 
as  the  hypothesis.  The  method  is  useful  near  class  boundaries  when  the  single  nearest  neighbor  may  be  of  the 
wrong  class  but  the  majority  are  not.  If  Sx  is  the  set  of  the  k closest  prototypes  voting  on  the  class  of  x then  it 
is  the  union  of  the  sets  of  voting  prototypes,  S^\  containing  only  prototypes  of  class  i.  The  k-NN  discriminant 
function  is  then  simply  the  set  size: 

A(x)  = |4‘>|. 


6.3.2  Weighted  Several  Necirest  Neighbors  (WSNN) 

A more  elaborate  form  of  the  nearest  neighbor  method  is  to  allow  A:  to  be  a random  variable  such  that  the  number 
of  voting  neighbors  is  different  for  each  unknown.  This  classifier  finds  the  closest  prototype  to  the  unknown,  then 
defines  the  “neighboring”  prototypes  to  be  those  whose  squared  Euclidean  distance  from  the  unknown  is  less  than 
a times  the  squared-distance  of  the  nearest  prototype,  where  a is  a constant.  Further  the  number  of  “votes” 
received  by  class  i is  divided  by  the  square  root  of  the  sum  of  squared-distances  of  class-z  near  neighbors  from  the 
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a = 1000 


Figure  6:  Weighted  Several  Nearest  Neighbors.  In  the  limit  of  small  a this  classifier  defaults  to  1-NN.  Note  the 
fine  grained  structure  throughout  that  is  typical  of  nearest  neighbor  methods. 


unknown,  so  as  to  diminish  the  importance  of  neighbors  that  are  relatively  far  away  compared  to  other  neighbors. 
Formally, 


O' 

4‘’ 

The  discriminant  functions  are  then 

A(x)  = I ^ ^ . 

t 0 otherwise 

Figure  6 shows  the  WSNN  class  regions  resulting  from  a values  of  50,  500  and  1000. 

6.4  Neural  Net  Classifiers 

6.4.1  Multi-Layer  Perceptron  (MLP) 

This  classifier  is  also  known  as  a feedforward  neural  net.  We  have  used  an  MLP  with  three  layers  (counting  the 
inputs  as  a layer).  It  will  be  convenient  to  define  the  following  notation: 

= number  of  nodes  in  layer  {i  = 0, 1,  2),  N’l®!  = n,  = L 
f{x)  = 1/(1  + e“^)  = sigmoid  function 

= bias  weight  of  node  of  layer  (A:  = 1,  2;  1 < z < 

= weight  connecting  node  of  k^^  layer  to  node  of 
{k  - 1)‘^  layer  (A:  = 1,  2;  1 < z'  < l<j< 


= V 


neighborhood-size  factor 

the  set  of  indices  of  class-z  training  vectors  that  are 
in  the  u-neighborhood  of  unknown  vector  x 

1 < i < Mi,d‘^  (x,  x^'^)  < a min  <f  (x,  x^^^)  | 
( - \ 3 j i<k<N,i<p<Mk  \ p y J 

= number  of  “votes”  for  class  z 


The  discriminant  functions  are  then  of  the  form 


A(x)  = / 


For  the  training  of  the  weights  of  this  network,  a reasonable  procedure  is  the  use  of  an  optimization  algorithm  to 
minimize  the  mean-squared-error,  over  the  training  set,  between  the  discriminant  values  actually  produced  and 
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Figure  7:  MLP  Clcissification  and  Confidence  Maps.  From  left;  class  boundaries,  highest  discriminant  value, 
difference  in  highest  two  discriminant  values. 

“target  discriminant  values”  consisting  of  the  appropriate  strings  of  I’s  and  O’s  as  defined  by  the  actual  classes  of 
the  training  examples.  For  example,  if  a training  feature  vector  is  of  class  2,  then  its  target  vector  of  discriminant 
values  is  set  to  (0,1, 0,0, 0,0, 0,0, 0,0).  It  is  more  feasible  to  minimize  this  kind  of  an  “error  function”  than  to  attempt 
to  directly  minimize  the  number  of  incorrectly  classified  training  examples,  since  the  latter  number  will  take  on 
only  relatively  few  values  and  is  a discontinuous  “step  function”.  The  error  function  is  modified  by  addition  of  a 
scalar  “regularization”  term  [26].  This  equals  a tunable  constant.  A,  multiplied  by  the  mean  square  weight, 

This  term  prevents  large  weights  which  are  associated  with  overtraining,  i.e.  the  overfitting  of  the  weights  to  the 
training  data.  This  has  been  shown  to  increase  the  generalization  ability  of  the  network  [27]. 

Networks  of  the  MLP  type  are  the  most  commonly  used  “neural  nets”  in  use  today,  and  they  are  usually  trained 
using  a “backpropagation”  algorithm  [28].  A “scaled  conjugate  gradient”  training  method  instead  [29,  30,  31,  27] 
has  been  preferred  to  the  ubiquitous  backpropagation  method,  speed  gains  of  an  order  of  magnitude  being  typical. 
Figure  7 shows  MLP  class  regions  resulting  from  varying  the  first  two  inputs  to  a trained  8 input,  48  hidden  unit 
network. 


6.4.2  Radial  Basis  Functions  (RBFl  and  RBF2) 


Neural  nets  of  the  Radial  Basis  Functions  type  get  their  name  from  the  fact  that  they  are  built  from  radially 
symmetric  Gaussian  functions  of  the  inputs.  Actually,  the  RBF  nets  discussed  here  use  Gaussian  functions  that 
are  more  general  than  radially  symmetric  functions:  their  constant  potential  surfaces  are  ellipsoids  whose  axes  are 
parallel  to  the  coordinate  axes,  whereas  radially  symmetric  Gaussian  functions  have  spherical  constant  potential 
surfaces.  However,  the  name  Radial  Basis  Functions  has  become  customary  for  any  neural  net  that  uses  Gaussian 
functions  in  its  first  layer. 

We  have  experimented  with  RBF  networks  of  two  types,  which  will  be  denoted  RBFl  and  RBF2.  The  following 
notation  will  be  convenient: 


= number  of  nodes  in  layer  (i  = 0, 1,2) 

= center  vector  of  hidden  node  (1  < J < G R”)  = i 

crO)  = width  vector  of  j***  hidden  node  (1  < i < R")  = . . . , 

= bias  weight  to  the  node  of  the  layer 
f{x)  = 1/(1  + e“^)  = sigmoid  function 

Wij  = weight  connecting  i***  output  node  to  hidden  node  (1  < f 1 < j < 


Each  hidden  node  computes  a radial  basis  function.  For  RBFl,  these  functions  are  unbiased  exponentials 

d>j(x)  = exp  (-r2(x,  crO))^  , 
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Figure  8:  RBFl  Classification  regions  for  increasing  numbers  of  centers  per  class. 


and  for  RBF2,  they  are  of  the  sigmoidal  form  with  bias 

For  either  type  of  RBF,  the  discriminant  function  is  the  following  function  of  the  radial  basis  functions: 


A(x)  = / 


i=i 


The  centers  widths  hidden-node  bias  weights  6^^^  (RBF2  only),  output-node  bias  weights  and 
output-node  weights  Wij  may  be  collectively  thought  of  as  the  trainable  “weights”  of  the  RBF  network.  They  are 
trained  initially  using  the  cluster  means  (from  a “K- means”  algorithm  applied  to  the  prototype  set)  as  the  center 
vectors  The  width  vectors  are  set  to  a single  tunable  positive  value.  More  sophisticated  methods  of 
determining  RBF  parameters  may  be  found  in  [32]  [33].  The  output  layer  weights  are  set  such  that  each  output 
node  is  connected  with  a positive  weight  to  hidden  nodes  of  its  class  (that  is,  hidden  nodes  whose  initial  center 
vectors  are  means  of  clusters  from  its  class),  and  connected  with  a negative  weight  to  hidden  nodes  of  other 
classes.  Training  proceeds  by  optimization  identical  to  that  described  for  the  MLP.  Figure  8 shows  RBFl  class 
regions  resulting  from  the  use  of  up  to  6 hidden  nodes  per  class,  and  Figure  9 shows  RBF2  class  regions  for  the 
same  numbers  of  hidden  nodes  per  class. 


6.4.3  Probabilistic  Neural  Net  (PNN) 

This  classifier  is  proposed  in  a 1990  paper  by  Specht  [34].  Each  training  example  becomes  the  center  of  a kernel 
function  which  takes  its  maximum  at  the  example  and  recedes  gradually  as  one  moves  away  from  the  example 
in  feature  space.  An  unknown  x is  classified  by  computing,  for  each  class  z,  the  sum  of  the  values  of  the  class-z 
kernels  at  x.  Many  forms  are  possible  for  the  kernel  functions;  we  have  obtained  our  best  results  using  radially 
symmetric  Gaussian  kernels.  The  resulting  discriminant  functions  are  of  the  form 

A(x)  = ^exp  (x,x5.^))  j , 
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Figure  9:  RBF2  Classification  regions  for  increasing  numbers  of  centers  per  class. 


where  cr  is  a scalar  “smoothing  parameter”  that  may  be  optimized  by  trial  and  error.  Figure  10  shows  the  PNN 
class  regions  resulting  from  the  use  of  cr  values  of  0.25,  1.00,  and  5.00.  Notice  that  a small  cr  value  produces  very 
complex  class  regions  similar  to  those  of  1-NN,  and  that  as  cr  is  increased,  the  regions  become  simpler. 


7 Class  Finder  and  Rejector 

The  “Class  Finder”  module  is  a function  mapping  the  discriminant  values  to  a single  index  indicating  hypothesized 
class.  In  the  most  elementary  case  the  index  of  the  maximum  indicates  the  class.  The  function  may  be  more 
complicated;  discriminant  values  associated  with  the  same  class  may  be  combined  in  some  weighted  voting. 

The  Rejector  module  produces  a confidence  value;  it  is  a function  of  the  discriminant  values 

a((T>l(x),...,Z);v(x)):R^-.R^ 

and  it  quantifies  the  assertiveness  of  the  classifier  for  the  unknown  x.  The  simplest  use  of  a is  to  subtract  from 
it  some  predefined  threshold  value,  oq.  A negative  result  implies  rejection  of  the  hypothesized  class  of  x.  This 
mechanism  allows  examplars  to  remain  unclassified  with  the  intent  of  achieving  a lower  substitutional  error  rate 
on  those  testing  vectors  whose  classifications  are  accepted.  This  study  only  considers  the  simple  strategy  of 
using  the  maximum  discriminant  value  as  the  confidence.  An  error  rate  versus  rejection  rate  plot  is  obtained  by 
selecting  one  confidence  rule  and  varying  the  threshold  oq.  High  thresholds  cause  rejection  of  more  examples  but 
lower  error  rates  on  those  accepted. 

8 Comparison  of  the  Classifiers 

8.1  Accuracy 

For  each  class  z,  the  number  of  the  2314  class-i  test  digits  that  were  correctly  classified  is  denoted  by  Cj.  Clearly, 
Cj/2314  may  be  used  as  an  estimate  of  the  conditional  probability  of  correct  classification  of  a print  given  that  the 
actual  class  of  the  print  is  i.  The  mixture  probability  of  correct  classification  is  then  the  a priori  weighted  sum 
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Figure  10:  PNN  Classification  and  Confidence  Maps  in  2 dimensions  for  increasing  a values.  From  left:  class 
boundaries,  highest  discriminant  value,  difference  in  highest  two  discriminant  values. 


of  the  by-class  probabilities.  Table  8.1  shows  for  each  classifier  these  estimated  probabilities  of  error,  expressed 
as  percentages,  for  increasing  dimensionality  of  the  K-L  feature  set. 

Note  that  the  optimal  number  of  features  (shown  in  bold)  is  not  the  same  for  all  classifiers,  the  parametric 
classifiers,  QMD  and  NRML,  being  noticeably  more  parsimonious  in  the  number  of  features  required.  It  is  also 
apparent  that  most  of  the  classifiers  attain  a plateau  as  the  number  of  features  reaches  approximately  32  thereafter 
only  gaining  several  tenths  of  a percent.  The  best  classifiers  are  the  computationally  expensive  nearest  neighbor 
classifiers  and  their  relative  PNN.  They  achieve  one  third  less  errors  than  the  neural  networks  and  parametric 
classifiers.  The  optimum  value  of  a = 1.1  for  WSNN  corresponds  to  a 1-NN  scheme  for  most  test  patterns. 
Accordingly  k-NN  is  seen  to  have  a higher  error  rate  for  increasing  k. 

Figure  11  gives  the  error  versus  rejection  profiles  for  the  classifiers. 

8.2  Computational  Requirements 

High  training  costs  for  a classifier  can  hinder  experimentation,  of  course,  and  sufficiently  large  expense  in  training 
can  ultimately  preclude  the  use  of  such  a classifier.  Of  the  algorithms  described  here,  for  a fixed  size  training 
set,  no  classifier  took  more  than  three  hours  of  workstation  time.  The  neural  networks  are  notoriously  expensive 
to  train  whereas  the  nearest  neighbor  methods,  including  PNN,  require  no  training.  Once  off-line  training  is 
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6.7 
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6.7 

6.6 
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5.4 

5.1 

5.2 

5.4 
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26.2 
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6.3 

5.1 

5.0 

4.8 

4.9 

5.1 

5.1 

5.2 

5.3 

5.6 

5.6 

5.8 

5.8 

5.9 
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23.6 
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5.8 

4.9 
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4.7 

4.9 
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5.0 
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5.7 
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25.5 
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5.0 
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4.5 

4.9 

5.0 

5.3 

5.5 
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6.5 

6.9 

7.2 

7.6 

nrml 

26.1 

9.9 

6.3 

5.1 

5.0 

4.8 

4.9 

5.0 

5.0 

5.2 

5.3 

5.5 

5.6 

5.5 

5.5 

5.6 

Table  1:  Dependence  of  Classification  Error  on  KL  Transform  Dimensionality.  Given  with  the  classifier  acronym 
are:  For  k-NN  the  value  of  k,  for  WSNN  the  value  of  or,  for  PNN  the  value  of  cr,  for  MLP  networks  the  number  of 
hiddens  units,  for  RBF  networks  the  number  of  centers  per  class,  and  for  EMD  and  QMD  classifiers  the  number 
of  clusters  per  class.  Bold  type  indicates  the  dimensionality  yielding  minimum  error  for  each  classifier. 


complete  the  classification  rate  is  of  more  interest.  The  dominant  term  in  recogniton  rate  is  not  classification 
time  but  the  cost  of  dimensionality  reducing  feature  extraction.  Pure  classification  rates,  excluding  the  K-L 
transform  time,  range  from  13  characters  per  second  (cps)  for  the  neighbor  classifiers  (KNN,  WSNN  and  PNN) 
through  80  cps  for  two-cluster  QMD  to  130  cps  for  RBF  and  250  cps  for  the  MLP  all  on  a serial  workstation. 
Nevertheless  timing  is  particularly  difficult  and  the  reader  is  encouraged  to  first  consider  algorithmic  complexity. 


9 Future  Work 

The  focus  in  academia  and  commerce  on  OCR  research  is  now  migrating  toward  the  more  difficult  problem  of 
recognition  of  structured  documents.  At  its  ^si^pleSTthis  involves  segmentation  and  recognition  of  text  fields.  The 
processes  may  be  tightly  coupled  as  in  th^cse  of  recognition  of  multiple  objects  or  cluttered  field  OCR.  The  Image 
Recognition  Group  recognizes  that  digit  OURis  essentially  a solved  problem  for  many  applications.  However  work 
will  continue  into  upper  and  lower  case  recognition  as  it  applies  to  text  field  processing.  There  is  an  emphasis 
on  investigation  of  algorithms  that  conserve  computational  resources  as  required  by  commercial  products.  In 
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Figure  11:  Error  versus  Rejection,  log  e(r).  The  log  of  the  classification  error  percent  of  accepted  patterns  as  a 
function  of  the  low  confidence  classification  rejection  percentage.  The  initial  gradients  of  the  curves  e'(r)  range 
from  -0.62  to  -0.47. 


particular,  the  relationship  between  dimensionality,  prototype  set  size,  feature  type  and  computational  expense 
is  a candidate  for  investigation.  For  example  the  preservation  of  performance  while  reducing  neighbor  prototype 
set  size  is  of  obvious  interest. 


10  Summary  and  Conclusions 

We  have  performed  numerous  experiments  for  digit  OCR  using  “statistical”  and  “neural  net”  classification  of  K-L 
features.  The  result  is  that  the  EMD  and  QMD  parametric  classifiers  are  able  to  compete  with  the  popular  MLP 
and  RBF  neural  architectures  presented  here.  The  lowest  error  rate  classifier,  PNN,  as  described  here  is  more  akin 
to  the  KNN  and  WSNN  non-parametric  neighbor  methods  in  terms  of  error  rate  and  computational  expense  than 
to  the  other  neural  network  schemes.  The  authors  maybe  contacted  over  email  using  jerry@magi.ncsl.nist.gov 
and  patrick@magi.ncsl.nist.gov. 
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